AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Fu, Yihui; Cheng, Luyao; Lv, Shubo; Jv, Yukai; Kong, Yuxiang; Chen, Zhuo; Hu, Yanxin; Xie, Lei; Wu, Jian; Bu, Hui; Xu, Xin; Du, Jun; Chen, Jingdong

Computer Science > Sound

arXiv:2104.03603 (cs)

[Submitted on 8 Apr 2021 (v1), last revised 10 Aug 2021 (this version, v4)]

Title:AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Authors:Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen

View PDF

Abstract:In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.

Comments:	Accepted by Interspeech 2021
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2104.03603 [cs.SD]
	(or arXiv:2104.03603v4 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2104.03603

Submission history

From: Yihui Fu [view email]
[v1] Thu, 8 Apr 2021 08:38:44 UTC (349 KB)
[v2] Fri, 18 Jun 2021 02:36:19 UTC (349 KB)
[v3] Wed, 14 Jul 2021 17:06:45 UTC (402 KB)
[v4] Tue, 10 Aug 2021 09:15:00 UTC (403 KB)

Computer Science > Sound

Title:AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators